Frontiers in Artificial Intelligence — Latest Matching Preprints

1

Machine learning-driven prediction of opioid and stimulant-related drug overdose fatalities: Analysis of the potential fourth wave

Eze, C. D.; Hansen, R.; Abate, M.; Smith, G.; Al-Mamun, M. A.

2025-12-11 addiction medicine 10.64898/2025.12.09.25341941 medRxiv

Top 0.1%

34.8%

Show abstract

Between 2010 and 2021, fentanyl and stimulants co-involved deaths increased from 0.6% to 32.3% of all overdose deaths in the U.S. The Centers for Disease Control and Prevention monitors overdose deaths, but reports are delayed by about 4 -6 months. Therefore, advanced methods are needed for optimized trend monitoring and preparing the healthcare system. We developed and compared traditional and machine learning (ML)-based time series prediction models for forecasting opioid and stimulant-involved death rates. Forensic research data (2015 to 2023) built from the West Virginia (WV) Office of the Chief Medical Examiner data were used for this study. Decedents with any opioid or any stimulant-involved death were identified and placed into three cohorts [boxh] opioid-only, stimulant-only, and opioid and stimulant co-involved deaths. Monthly death rate per 100,000 was calculated for each cohort using total cases per month and West Virginia population data from Census Bureau. Autoregressive Integrated Moving Average (ARIMA), Random Forest (RF), and Extreme Gradient Boosting (XGBoost) variant models (differenced, non-differenced, and blended) models were trained on 80% of each cohorts time-ordered data. An iterative forecasting of the 20% testing data was conducted. Model performance on the test prediction was evaluated using calculated metrics such as root mean square error (RMSE), R2, mean absolute error (MAE), and mean absolute percentage error (MAPE) values. Counts and percentages of cases per year were obtained for each cohort. Death rate and model predictions were represented in time series. Models performance for each cohort were compared using the performance metrics. 10,812 cases were identified from 2015 to 2023 with 4,295 involving opioid-only, 1,392 involving stimulant-only, and 4,175 co-involving an opioid and a stimulant. Stimulant-only and opioid and stimulant co-involved death rates had an upward trend with a peak in opioid and stimulant co-involved death in 2021. Although opioid-only death rate had a downward trend over time, the death rate peaked in 2020. The non- differenced XGBoost model outperformed for opioid-only (R2 = 0.92, RMSE = 0.12, MAE = 0.10, MAPE = 6.59%) and simulant-only (R2 = 0.91, RMSE = 0.07, MAE = 0.06, MAPE = 7.35%) death rate prediction. The blended XGBoost model had the best performance for opioid and stimulant co-involved death rate prediction (R2 = 0.78, RMSE = 0.31, MAE = 0.27, MAPE = 8.87%). Differenced XGBoost models outperformed other models for short term forecasting, while the non-different variants performed better for long-term predictions. Machine learning models, especially, the XGBoost variants outperformed other models for predicting opioid-only, stimulant-only, and opioid and stimulant co-involved death rates, respectively. The differenced models can be used for early death rate signal detection while the non-differenced XGBoost models can aid long-term forecasts for overdose death monitoring, planning and allocation of resources in health systems. Author summaryThe United States has been faced with the problem of opioid abuse and overdose death for several decades. Currently, there is a rise in drug overdose deaths co-involving an opioid and a stimulant. Although the CDC monitors and produces a provisional overdose death count, this report is often delayed by 4-6 months. There is a need to develop a high accuracy predictive tool that can yield reliable forecasts of these overdose deaths that can be used to guide policy decisions and avoid the delay. Here we developed and compared machine learning (Extreme gradient boosting (XGBoost) and Random Forest) to traditional statistical (ARIMA) forecasting models for predicting overdose death rates involving an opioid, a stimulant, or both. We found that the XGBoost models performed better than ARIMA and Random Forest for making the predictions. Our study provides a tool that can be used to predict future overdose deaths and provide information to prepare health systems and communities to better respond to overdose deaths and develop policies targeting drug, especially opioid and stimulant overdose prevention.

2

Artificial Intelligence for Predicting Treatment Adherence in Opioid Use Disorder: A Scoping Review

Wasim, A.; Abud, A.; Malick, S.; Nashrah, N. A.; Losanova, V.; Raimugia, H.; Karim, M.; Le, M. V. K.; Baskarathasan, V.; Gibson, C.; Alkhatib, S.; Martins, S. S.

2025-07-11 addiction medicine 10.1101/2025.07.10.25331331 medRxiv

Top 0.1%

18.9%

Show abstract

Background and AimOpioid use disorder (OUD) is a chronic condition in which an individual engages in the persistent use of opioids that causes significant distress and negatively impacts their societal functioning. Treatment for OUD involves pharmacological therapies such as methadone, buprenorphine, and naltrexone, typically used in combination with behavioral interventions such as counselling and cognitive behavioural therapy. However, non-adherence to OUD treatment is high, potentially leading to negative outcomes like relapse and increased risk of overdose. Therefore, identifying patients at risk of treatment nonadherence is essential to ensure that OUD is adequately managed. Models utilizing AI and ML techniques have emerged as promising candidates to achieve risk stratification in this patient population. We conducted a scoping review to capture and systematically map existing literature on AI and ML applications predicting adherence to treatment in individuals with OUD. MethodsOvid MEDLINE, Embase, PsycINFO, Web of Science, Scopus, CINAHL, IEEE Xplore, and ACM Digital Library were searched to identify and obtain peer-reviewed empirical research articles published from inception to October 7, 2024. Twenty-two studies were selected to be included in the review. ResultsAll studies that matched our search criteria were published after 2018 and predominantly conducted in the United States. Random forest models were frequently identified as the top performer although significant variability in algorithms, evaluation metrics, and key predictors was noted in the literature. ConclusionThe need for future research to cover more geographical locations, diversify patient populations, focus on a standardized group of models and outcomes, and utilize larger samples was highlighted.

3

Multiple Cost Optimisation for Alzheimer's Disease Diagnosis

McCombe, N.; Ding, X.; Prasad, G.; Finn, D. P.; Todd, S.; McClean, P. L.; Wong-Lin, K.

2022-04-16 neurology 10.1101/2022.04.10.22273666 medRxiv

Top 0.1%

18.6%

Show abstract

Current machine learning techniques for dementia diagnosis often do not take into account real-world practical constraints, which may include, for example, the cost of diagnostic assessment time and financial budgets. In this work, we built on previous cost-sensitive feature selection approaches by generalising to multiple cost types, while taking into consideration that stakeholders attempting to optimise the dementia care pathway might face multiple non-fungible budget constraints. Our new optimisation algorithm involved the searching of cost-weighting hyperparameters while constrained by total budgets. We then provided a proof of concept using both assessment time cost and financial budget cost. We showed that budget constraints could control the feature selection process in an intuitive and practical manner, while adjusting the hyperparameter increased the range of solutions selected by feature selection. We further showed that our budget-constrained cost optimisation framework could be implemented in a user-friendly graphical user interface sandbox tool to encourage non-technical users and stakeholders to adopt and to further explore and audit the model - a humans-in-the-loop approach. Overall, we suggest that setting budget constraints initially and then fine tuning the cost-weighting hyperparameters can be an effective way to perform feature selection where multiple cost constraints exist, which will in turn lead to more realistic optimising and redesigning of dementia diagnostic assessments. Clinical RelevanceBy optimising diagnostic accuracy against various costs (e.g. assessment administration time and financial budget), predictive yet practical dementia diagnostic assessments can be redesigned to suit clinical use.

4

Performance drift is a major barrier to the safe use of machine learning in cardiac surgery

Dong, T.; Sinha, S.; Zhai, B.; Fudulu, D. P.; Chan, J.; Narayan, P.; Judge, A.; Caputo, M.; Dimagli, A.; Benedetto, U.; Angelini, G. D.

2023-01-22 health informatics 10.1101/2023.01.21.23284795 medRxiv

Top 0.1%

18.3%

Show abstract

ObjectivesThe Society of Thoracic Surgeons (STS), and EuroSCORE II (ES II) risk scores, are the most commonly used risk prediction models for adult cardiac surgery post-operative in-hospital mortality. However, they are prone to miscalibration over time, and poor generalisation across datasets and their use remain controversial. It has been suggested that using Machine Learning (ML) techniques, a branch of Artificial intelligence (AI), may improve the accuracy of risk prediction. Despite increased interest, a gap in understanding the effect of dataset drift on the performance of ML over time remains a barrier to its wider use in clinical practice. Dataset drift occurs when a machine learning system underperforms because of a mismatch between the dataset it was developed and the data on which it is deployed. Here we analyse this potential concern in a large United Kingdom (UK) database. MethodsA retrospective analyses of prospectively routinely gathered data on adult patients undergoing cardiac surgery in the UK between 2012-2019. We temporally split the data 70:30 into a training and validation subset. ES II and five ML mortality prediction models were assessed for relationships between and within variable importance drift, performance drift and actual dataset drift using temporal and non-temporal invariant consensus scoring, combining geometric average results of all metrics as the Clinical Effective Metric (CEM). ResultsA total of 227,087 adults underwent cardiac surgery during the study period with a mortality rate of 2.76%. There was a strong evidence of decrease in overall performance across all models (p < 0.0001). Xgboost (CEM 0.728 95CI: 0.728-0.729) and Random Forest (CEM 0.727 95CI 0.727-0.728) were the best overall performing models both temporally and non-temporally. ES II perfomed worst across all comparisons. Sharp changes in variable importance and dataset drift between 2017-10 to 2017-12, 2018-06 to 2018-07 and 2018-12 to 2019-02 mirrored effects of performance decrease across models. ConclusionsCombining the metrics covering all four aspects of discrimination, calibration, clinical usefulness and overall accuracy into a single consensus metric improved the efficiency of cognitive decision-making. All models show a decrease in at least 3 of the 5 individual metrics. CEM and variable importance drift detection demonstrate the limitation of logistic regression methods used for cardiac surgery risk prediction and the effects of dataset drift. Future work will be required to determine the interplay between ML and whether ensemble models could take advantage of their respective performance advantages. Central messageML performance decreases over time due to dataset drift, but remains superior to ES II. Therefore regular assessment and modification of ML models may be preferable. Prospective messageA gap in understanding the effect of dataset drift on the performance of ML models over time presents a major barrier to their clinical application. Xgboost and Random Forest have shown superior performance both temporally and non-temporally against ES II. However, a decrease in model performance of all models due to dataset drift suggests the need for regular drift monitoring.

5

A performance evaluation of neural network features and functions settings on the model accuracy

Bozdech, M.

2022-11-30 scientific communication and education 10.1101/2022.11.28.518263 medRxiv

Top 0.1%

14.1%

Show abstract

Not only in sports is a neural network the most used type of artificial intelligence. With software development, anyone can create a neural network model, but little is known about how to prepare the data and how to set up the model algorithms to their maximum performance. For these reasons, this study aims to determine whether features or function settings have a greater effect on model accuracy. An initial feature dataset (n = 18882) was obtained from publicly available sources. Each of the six different feature settings consisted of 96 models. A total of 384 models were created, in which their testing accuracy and the percentage difference between the training and testing phases were further analyzed. No statistically significant differences were found between the accuracy of the functions settings, but statistically significant differences were confirmed between the feature settings. The study found that feature settings, especially the reduction of the number of outputs, are a more important factor in increasing the model accuracy, than function settings. Although the literature focuses more on the function setting and sets feature setting is taken rather as a type of how to improve the model.

6

The Independence of Discrimination and Calibration in Clinical Risk Prediction: Lessons from a Multi-Timeframe Diabetes Prediction Framework

OReilly, E.; Kurakovas, T.

2026-02-14 health informatics 10.64898/2026.02.12.26346147 medRxiv

Top 0.1%

13.8%

Show abstract

BackgroundClinical risk prediction models are typically evaluated by discrimination (area under the receiver operating characteristic curve, AUC), with calibration receiving less attention. We developed a multi-timeframe diabetes prediction framework emphasizing calibration and used synthetic data validation to investigate whether good discrimination guarantees good calibration. MethodsWe generated 500,000 synthetic patients using published epidemiological parameters from QDiabetes-2018, FINDRISC, and the Diabetes Prevention Program. The framework comprises a discrete-time survival ensemble with isotonic calibration, producing predictions at 6, 12, 24, and 36 months with bootstrap confidence intervals. We evaluated discrimination (AUC), bin-level calibration (expected calibration error, ECE), calibration-in-the-large (observed-to-expected ratio), and clinical utility (decision curve analysis). We compared performance against QDiabetes-2018 implemented on the same synthetic cohort. ResultsDespite achieving excellent discrimination (AUC = 0.844, 95% CI: 0.840-0.848) and low bin-level calibration error (ECE = 0.006), the framework systematically overpredicted risk by 50%: mean predicted probability was 8.4% versus observed rate of 5.6% (observed-to-expected ratio = 0.66, 95% CI: 0.65-0.67). This miscalibration occurred despite isotonic regression on a held-out calibration set. Overprediction was present in 9 of 10 risk deciles. Risk stratification remained valid (23.5-fold separation, 95% CI: 22.8-24.3, between highest and lowest tiers), confirming that discrimination was preserved. QDiabetes-2018 achieved comparable discrimination (AUC = 0.831) with better calibration (O:E = 0.89). Decision curve analysis showed net benefit across threshold range 5-30%, though recalibration would improve clinical utility. ConclusionsGood discrimination does not guarantee good calibration. Our primary finding is negative: isotonic calibration failed to produce well-calibrated predictions even on synthetic data from a single generator. This has important implications for model deployment, where distribution shift is inevitable. We recommend that prediction model studies report calibration-in-the-large alongside bin-level metrics, as ECE alone can be misleading when risk distributions are skewed. Recalibration on deployment populations will likely be necessary for any prediction model, regardless of development-phase calibration performance. Key MessagesO_ST_ABSWhat is already knownC_ST_ABSO_LIClinical prediction models require both discrimination (ranking patients correctly) and calibration (accurate probability estimates) C_LIO_LIIsotonic regression is a recommended approach for post-hoc calibration C_LIO_LIExpected calibration error (ECE) is commonly reported as a summary calibration metric C_LI What this study addsO_LIDemonstrates empirically that excellent discrimination (AUC = 0.844) can coexist with substantial miscalibration (50% overprediction) C_LIO_LIShows that low ECE can be misleading when most patients fall in low-risk deciles C_LIO_LIProvides evidence that isotonic calibration on held-out data may not generalize even within synthetic data from one generator C_LIO_LIDemonstrates a discrete-time survival architecture that reduces monotonicity violations to <0.1% C_LI How this study might affect research, practice, or policyO_LIPrediction model studies should report calibration-in-the-large (O:E ratio) alongside ECE C_LIO_LIDevelopers should expect recalibration to be necessary when deploying to new populations C_LIO_LIClaims of calibrated prediction should be viewed skeptically without comprehensive calibration assessment C_LI

7

Using System Dynamics Modeling To Assess The Impact Of Connecticut Good Samaritan Laws: A Protocol Paper

Ali, S. S.; Sabounchi, N. S.; Heimer, R.; DOnofrio, G.; Violette, C.; LaWall, K.; Heckmann, R.

2022-01-06 addiction medicine 10.1101/2022.01.06.22268677 medRxiv

Top 0.1%

12.9%

Show abstract

BackgroundWe applied a participatory system dynamics (SD) modeling approach to evaluate the effectiveness and impact of Connecticuts Good Samaritan Laws (GSLs) that are designed to promote bystander intervention during an opioid overdose event and reduce opioid overdose-related adverse outcomes. Our SD model can be used to predict whether additional revisions of the statutes might make GSLs more effective. SD modeling is a novel approach for assessing the impact of GSLs; and, in this protocol paper, we describe its applicability to our policy question, as well as expected outcomes of this approach. MethodsThis project began in February 2021 and is expected to conclude by March 2022. During this time, a total of six group model-building (GMB) sessions will have been held with key stakeholders to elicit feedback that will, in turn, contribute to the development of a more robust SD model. Session participants include bystanders who witness an overdose, law enforcement personnel, first responders, pharmacists, physicians, and other health care professionals who work in at least two major metropolitan areas of Connecticut (New Haven and Hartford). Due to the restrictions imposed by the COVID-19 pandemic, the sessions are being held virtually via Zoom. The information obtained during these sessions will be integrated with a draft SD model that has already been developed by the modeling team as part of a previous CDC-funded project. Model calibration and policy simulations will then be performed to assess the impact of the current GSLs and to make recommendations for future public policy changes. DiscussionAn SD modeling approach enables capture of complex interrelationships among multiple health outcomes to better assess the drivers of the opioid epidemic in Connecticut. The models simulation results are expected not only to align with current real-world data but also to recreate historical trends and infer future trends in a situationally relevant fashion. This will facilitate the work of policy makers who are devising and implementing time-sensitive changes to address opioid overdose-related deaths at the state level. Replicating our approach as described can be applied to make similar improvements in other jurisdictions. CONTRIBUTIONS TO THE LITERATURE- System dynamics (SD) modeling and group model-building (GMB) approaches enable the group to start with a simple concept model and apply the collective knowledge of the group to finish the session with a much more developed model that can produce impressively accurate simulation results. - The model will be used to understand the impact of Connecticuts Good Samaritan Laws (GSLs), as well as their limitations, and to deduce factors to further improve public health laws to counter opioid overdose-related deaths. - The approach can be applied to other jurisdictions, taking into account local conditions and existing Good Samaritan legislation.

8

Machine Learning Applied to Clinical Laboratory Data Predicts Patient-Specific, Near-Term Relapse in Patients in Medication for Opioid Use Disorder Treatment

Pyzowski, P.; Herbert, B.; Malik, W. Q.

2020-08-14 addiction medicine 10.1101/2020.08.10.20163881 medRxiv

Top 0.1%

12.7%

Show abstract

We have developed a data-driven, algorithmic method for identifying patients in an out-patient buprenorphine program at high risk for relapse in the following seven days. This method uses data already available in clinical laboratory data, can be made available in a timely matter, and is easily understandable and actionable by clinicians. Use of this method could significantly reduce the rate of relapse in addiction treatment programs by targeting interventions at those patients most at risk for near term relapse.

9

Machine Learning Applications and Advancements in Alcohol Use Disorder: A Systematic Review

Hurtado, M.; Siefkas, A.; Attwood, M. M.; Iqbal, Z.; Hoffman, J.

2022-06-07 addiction medicine 10.1101/2022.06.06.22276057 medRxiv

Top 0.1%

12.6%

Show abstract

BackgroundAlcohol use disorder (AUD) is a chronic mental disorder that leads to harmful, compulsive drinking patterns that can have serious consequences. Advancements are needed to overcome current barriers in diagnosis and treatment of AUD. ObjectivesThis comprehensive review analyzes research efforts that apply machine learning (ML) methods for AUD prediction, diagnosis, treatment and health outcomes. MethodsA systematic literature review was conducted. A search performed on 12/02/2020 for published articles indexed in Embase and PubMed Central with AUD and ML-related terms retrieved 1,628 articles. We identified those that used ML-based techniques to diagnose AUD or make predictions concerning AUD or AUD-related outcomes. Studies were excluded if they were animal research, did not diagnose or make predictions for AUD or AUD-related outcomes, were published in a non-English language, only used conventional statistical methods, or were not a research article. ResultsAfter full screening, 70 articles were included in our review. Algorithms developed for AUD predictions utilize a wide variety of different data sources including electronic health records, genetic information, neuroimaging, social media, and psychometric data. Sixty-six of the included studies displayed a high or moderate risk of bias, largely due to a lack of external validation in algorithm development and missing data. ConclusionsThere is strong evidence that ML-based methods have the potential for accurate predictions for AUD, due to the ability to model relationships between variables and reveal trends in data. The application of ML may help address current underdiagnosis of AUD and support those in recovery for AUD.

10

A Semi-Supervised Contrastive Learning Approach to Alzheimers Disease Diagnostics using Convolutional Autoencoders

Jung, E. W.; Kashyap, A.; Hsu, B.; Moreland, M.; Chantaduly, C.; Chang, P.

2022-12-30 radiology and imaging 10.1101/2022.12.27.22283984 medRxiv

Top 0.1%

12.3%

Show abstract

PURPOSEAlzheimers Disease (AD) is a neurodegenerative disease that progressively deteriorates memory and cognitive abilities. PET 18F-AV45 (florbetapir) is a common imaging modality used to characterize the distribution of beta-amyloid deposits in the brain, however interpretation may be subjective and the misdiagnosis rate of AD ranges from 12-23%. Automated algorithms for PET 18F-AV45 interpretation including those derived from deep learning may facilitate more objective and accurate AD diagnosis. MATERIALS & METHODSA total of 1232 PET AV45 scans (207 - AD; 1025 - normal) were obtained from the Alzheimers Disease Neuroimaging Initiative (ADNI). A semi-supervised deep learning framework was developed to differentiate AD and normal patients. The framework consists of an autoencoder (AE), a contrastive learning loss, and a categorical classification head. A contrastive learning paradigm is used to improve the discriminative properties of latent feature vectors in multidimensional space. RESULTSUpon five-fold cross-validation, the best-performing semi-supervised contrastive model achieved validation accuracy of 82% to 86%. Secondary analysis included visualization of intermediate activations, classification report verification, and principal component analysis (PCA) of latent feature vectors. The training process yielded optimal converging losses for all three loss frameworks. CONCLUSIONA deep learning model can accurately diagnose AD using PET 18F-AV45 scans. Such models require large amounts of labeled data during training. The use of a semi-supervised contrastive learning objective and AE regularizer helps to improve model performance, especially when dataset sizes are constrained. Latent representations extracted by the model are visually clustered strongly with the addition of a contrastive learning mechanism. Summary StatementA semi-supervised contrastive learning deep learning system optimizes latent feature vector representations and yields strong model classification performance for larger data distributions within the Alzheimers Disease diagnostics domain. Key PointsO_LIA common diagnostic procedure used by trained radiologists in the clinical setting is the visual analysis of PET 18F-AV45 neuroimaging scans to diagnose the different stages of Alzheimers Disease in a patient. C_LIO_LIContrastive learning is a strategy that allows for the optimization of latent feature representations in multidimensional space through the use of a loss function that maximizes the distance between feature vectors of different classes and minimizes the distance of feature vectors of the same class. C_LIO_LIA semi-supervised contrastive learning approach can lead to improved performance and generalization of deep learning models optimized using small training datasets as encountered in Alzheimers Disease and other neurodegenerative conditions. C_LI

11

Analysis of Feature Influence on Covid-19 Death Rate Per Country Using a Novel Orthogonalization Technique

Gonnet, G.; Stewart, J.; Lafleur, J.; Keith, S.; McLellan, M.; Jiang-Gorsline, D.; Snider, T.

2021-07-05 health informatics 10.1101/2021.07.02.21259929 medRxiv

Top 0.1%

10.2%

Show abstract

We have developed a new technique of Feature Importance, a topic of machine learning, to analyze the possible causes of the Covid-19 pandemic based on country data. This new approach works well even when there are many more features than countries and is not affected by high correlation of features. It is inspired by the Gram-Schmidt orthogonalization procedure from linear algebra. We study the number of deaths, which is more reliable than the number of cases at the onset of the pandemic, during Apr/May 2020. This is while countries started taking measures, so more light will be shed on the root causes of the pandemic rather than on its handling. The analysis is done against a comprehensive list of roughly 3,200 features. We find that globalization is the main contributing cause, followed by calcium intake, economic factors, environmental factors, preventative measures, and others. This analysis was done for 20 different dates and shows that some factors, like calcium, phase in or out over time. We also compute row explainability, i.e. for every country, how much each feature explains the death rate. Finally we also study a series of conditions, e.g. comorbidities, immunization, etc. which have been proposed to explain the pandemic and place them in their proper context. While there are many caveats to this analysis, we believe it sheds light on the possible causes of the Covid-19 pandemic. One-Sentence SummaryWe use a novel feature importance technique to find that globalization, followed by calcium intake, economic factors, environmental factors, and some aspects of societal quality are the main country-level data that explain early Covid-19 death rates.

12

Deep Learning Improves Parameter Estimation in Reinforcement Learning Models

Xiong, H.-D.; Ji-An, L.; Mattar, M. G.; Wilson, R. C.

2025-03-24 animal behavior and cognition 10.1101/2025.03.21.644663 medRxiv

Top 0.1%

10.2%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWCognitive models are widely used in psychology and neuroscience to formulate and test hypotheses about cognitive processes. These processes are characterized by model parameters, which are then used for scientific inference. The reliability of scientific conclusions from cognitive modeling depends critically on the reliability of parameter estimation, yet estimating parameters remains a universal challenge particularly when data are too limited to constrain them. In such cases, multiple sets of parameters may explain the experimental data equally well within the same model, raising the question of which parameters are scientifically meaningful. We refer to this problem as parameter ambiguity. In this paper, we investigate parameter ambiguity in reinforcement learning under two optimization methods. We employ the de facto Nelder-Mead method (fminsearch) and a neural network trained to estimate parameters using a modern deep learning pipeline, which has seen limited application in cognitive modeling. Across ten decision-making datasets, we consistently find that the two methods produce substantially different parameter estimates despite achieving nearly identical fitting performance. To address this ambiguity, we introduce a systematic evaluation framework that goes beyond predictive accuracy to assess generalizability, robustness, identifiability, and test-retest reliability, thereby offering principled guidance on which parameter estimates should inform scientific inference. Applying this framework reveals that the neural network with a deep learning pipeline outperforms across these metrics. Our study establishes parameter ambiguity as an underappreciated challenge with significant implications for scientific replicability, highlighting that the choice of optimization method is a critical factor shaping scientific conclusions. We advocate for our multi-faceted evaluation approach to ensure reliable scientific inference and for broader integration of modern deep learning pipelines into cognitive modeling.

13

Identification of Suicide-Related Subgroups Using Latent Class Analysis: Complementary Insights to Explainable AI-Based Classification

Kizilaslan, B.; Mehlum, L.

2026-03-27 psychiatry and clinical psychology 10.64898/2026.03.25.26349264 medRxiv

Top 0.1%

10.1%

Show abstract

Purpose: Suicide and self-harm are major public health concerns characterized by substantial clinical and psychosocial heterogeneity. While latent class analysis has been used to identify subgroups of people with suicidal behavior, the extent to which such population-level phenotyping complements explainable artificial intelligence-based classification models remain unclear. Methods: We applied latent class analysis to a cross-sectional, publicly available dataset of 1000 individuals presenting with self-harm and suicide-related behaviors at Colombo South Teaching Hospital, Kalubowila, Sri Lanka. Sociodemographic, psychosocial, and clinical variables were used to identify latent subgroups. Class characteristics and suicide prevalence were examined and compared with variable importance patterns reported in a previously published explainable artificial intelligence (XAI)-based suicide classification study using the same dataset. Results: Four latent classes were identified. Two classes exhibited very high suicide prevalence (91.2% [95% CI: 87.7-93.8] and 99.0% [95% CI: 96.4-99.7]), whereas two classes showed low prevalence (<1%). The two high-prevalence classes differed markedly in lifetime psychiatric hospitalization history, with one class showing a 100% prevalence of prior hospitalization and the other substantially lower hospitalization rates. These patterns partially aligned with, and extended beyond, variable importance findings from the XAI-based model. Conclusion: Latent class analysis identified distinct subgroups with substantially different suicide prevalence and clinical profiles, underscoring the heterogeneity of individuals presenting with self-harm. Comparison with XAI-based suicide classification model findings suggest that unsupervised phenotyping and supervised classification provide complementary perspectives, offering population-level context that may enhance the interpretability of suicide assessment frameworks. Keywords: suicide; self-harm; latent class analysis; explainable artificial intelligence; machine learning

14

Beyond Accuracy: Investigating the Potential Clinical Utility of Predicting Functional Dependency and Severe Disability or Death in Successfully Reperfused Patients using Machine Learning

Meier, R.; Burri, M.; Fischer, S.; McKinley, R.; Jung, S.; Meinel, T.; Fischer, U.; Piechowiak, E. I.; Mordasini, P.; Gralla, J.; Wiest, R.; Kaesmacher, J.

2020-11-18 neurology 10.1101/2020.11.17.20232280 medRxiv

Top 0.1%

10.0%

Show abstract

ObjectivesMachine learning (ML) has been demonstrated to improve the prediction of functional outcome in patients with acute ischemic stroke. However, its value in a specific clinical use case has not been investigated. Aim of this study was to assess the clinical utility of ML models with respect to predicting functional impairment and severe disability or death considering its potential value as a decision-support tool in an acute stroke workflow. Materials and MethodsPatients (n=1317) from a retrospective, non-randomized observational registry treated with Mechanical Thrombectomy (MT) were included. The final dataset of patients who underwent successful recanalization (TICI [≥] 2b) (n=932) was split in order to develop ML-based prediction models using data of (n=745, 80%) patients. Subsequently, the models were tested on the remaining patient data (n=187, 20%). For comparison, baseline algorithms using majority class prediction, SPAN-100 score, PRE score, and Stroke-TPI score were implemented. The ML methods included eight different algorithms (e.g. Support Vector Machines and Random forests), stacked ensemble method and tabular neural networks. Prediction of modified Rankin Scale (mRS) 3-6 (primary analysis) and mRS 5-6 (secondary analysis) at 3 months was performed using 25 baseline variables available at patient admission. ML models were assessed with respect to their ability for discrimination, calibration and clinical utility (decision curve analysis). ResultsAnalyzed patients (n=932) showed a median age of 74.7 (IQR 62.7-82.4) years with (n=461, 49.5%) being female. ML methods performed better than clinical scores with stacked ensemble method providing the best overall performance including an F1-score of 0.75 {+/-} 0.01, an ROC-AUC of 0.81 {+/-} 0.00, AP score of 0.81 {+/-} 0.01, MCC of 0.48 {+/-} 0.02, and ECE of 0.06 {+/-} 0.01 for prediction of mRS 3-6, and an F1-score of 0.57 {+/-} 0.02, an ROC-AUC of 0.79 {+/-} 0.01, AP score of 0.54 {+/-} 0.02, MCC of 0.39 {+/-} 0.03, and ECE of 0.19 {+/-} 0.01 for prediction of mRS 5-6. Decision curve analyses suggested highest mean net benefit of 0.09 {+/-} 0.02 at a-priori defined threshold (0.8) for the stacked ensemble method in primary analysis (mRS 3-6). Across all methods, higher mean net benefits were achieved for optimized probability thresholds but with considerably reduced certainty (threshold probabilities 0.24-0.47). For the secondary analysis (mRS 5-6), none of the ML models achieved a positive net benefit for the a-priori threshold probability 0.8. ConclusionsThe clinical utility of ML prediction models in a decision-support scenario aimed at yielding a high certainty for prediction of functional dependency (mRS 3-6) is marginal and not evident for the prediction of severe disability or death (mRS 5-6). Hence, using those models for patient exclusion cannot be recommended and future research should evaluate utility gains after incorporating more advanced imaging parameters.

15

Effectiveness, Explainability and Reliability of Machine Meta-Learning Methods for Predicting Mortality in Patients with COVID-19: Results of the Brazilian COVID-19 Registry

Miranda de Paiva, B. B.; Pereira, P. D.; de Andrade, C. M. V.; Gomes, V. M. R.; Lima, M. C. P. B.; Silva, M. V. R. S.; Carneiro, M.; Martins, K. P. M. P.; Sales, T. L. S.; Carvalho, R. L. R. d.; Pires, M. C.; Ramos, L. E. F.; Silva, R. T.; Bezerra, A. F. B.; Schwarzbold, A. V.; Nunes, A. G. S.; Maurilio, A. d. O.; Scotton, A. L. B. A.; Costa, A. S. d. M.; Castro, A. A.; Farace, B. L.; Cimini, C. C. R.; De Carvalho, C. A.; Silveira, D. V.; Ponce, D.; Pereira, E. C.; Manenti, E. R. F.; Cenci, E. P. d. A.; Lucas, F. B.; Rodrigues, F. D.; Anschau, F.; Botoni, F. A.; Aranha, F. G.; Bartolazzi, F.;

2021-11-02 health informatics 10.1101/2021.11.01.21265527 medRxiv

Top 0.1%

10.0%

Show abstract

ObjectiveTo provide a thorough comparative study among state-of-the-art machine learning methods and statistical methods for determining in-hospital mortality in COVID-19 patients using data upon hospital admission; to study the reliability of the predictions of the most effective methods by correlating the probability of the outcome and the accuracy of the methods; to investigate how explainable are the predictions produced by the most effective methods. Materials and MethodsDe-identified data were obtained from COVID-19 positive patients in 36 participating hospitals, from March 1 to September 30, 2020. Demographic, comorbidity, clinical presentation and laboratory data were used as training data to develop COVID-19 mortality prediction models. Multiple machine learning and traditional statistics models were trained on this prediction task using a folded cross-validation procedure, from which we assessed performance and interpretability metrics. ResultsThe Stacking of machine learning models improved over the previous state-of-the-art results by more than 26% in predicting the class of interest (death), achieving 87.1% of AUROC and macro F1 of 73.9%. We also show that some machine learning models can be very interpretable and reliable, yielding more accurate predictions while providing a good explanation for the why. ConclusionThe best results were obtained using the meta-learning ensemble model - Stacking. State-of the art explainability techniques such as SHAP-values can be used to draw useful insights into the patterns learned by machine-learning algorithms. Machine-learning models can be more explainable than traditional statistics models while also yielding highly reliable predictions.

16

Multimodal BEHRT: Transformers for Multimodal Electronic Health Records to predict breast cancer prognosis

MBAYE, N. M.; Danziger, M.; Toussaint, A.; Dumas, E.; Guerin, J.; Hamy-Petit, A.-S.; Reyal, F.; Rosen-Zvi, M.; AZENCOTT, C.-A.

2024-09-23 health informatics 10.1101/2024.09.18.24312984 medRxiv

Top 0.1%

10.0%

Show abstract

BackgroundBreast cancer is a complex disease that affects millions of people and is the leading cause of cancer death worldwide. There is therefore still a need to develop new tools to improve treatment outcomes for breast cancer patients. Electronic Health Records (EHRs) contain a wealth of information about patients, from pathological reports to biological measurements, that could be useful towards this end but remain mostly unexploited. Recent methodological developments in deep learning, however, open the way to developing new methods to leverage this information to improve patient care. MethodsIn this study, we propose M-BEHRT, a Multimodal BERT for Electronic Health Record (EHR) data based on BEHRT, itself an architecture based on the popular natural langugage architecture BERT (Bidirectional Encoder Representations from Transformers). M-BEHRT models multimodal patient trajectories as a sequence of medical visits, which comprise a variety of information ranging from clinical features, results from biological lab tests, medical department and procedure, and the content of free-text medical reports. M-BEHRT uses a pretraining task analog to a masked language model to learn a representation of patient trajectories from data that includes data that is unlabeled due to censoring, and is then fine-tuned to the classification task at hand. Finally, we used a gradient-based attribution method -to highlight which parts of the input patient trajectory were most relevant for the prediction. ResultsWe apply M-BEHRT to a retrospective cohort of about 15 000 breast cancer patients from Institut Curie (Paris, France) treated with adjuvant chemotherapy, using patient trajectories for up to one year after surgery to predict disease-free survival (DFS). M-BEHRT achieves an AUC-ROC of 0.77 [0.70-0.84] on a held-out data set for the prediction of DFS 3 years after surgery, compared to 0.67 [0.58-0.75] for the Nottingham Prognostic Index (NPI) and for a random forest (p-values = 0.031 and 0.050 respectively). In addition, we identified subsets of patients for which M-BEHRT performs particularly well such as older patients with at least one lymph node affected. ConclusionIn conclusion, we proposed a novel deep learning algorithm to learn from multimodal EHR data. Learning from about 15 000 patient records, our model achieves state-of-the-art performance on two classification tasks. The EHR data used to perform these tasks was more homogeneous compared to other datasets used for pretraining, as it exclusively comprised adjuvant treated breast cancer patients. This highlights both the potential of EHR data for improving our understanding of breast cancer and the ability of transformer-based architectures to learn from EHR data containing much fewer than the millions of records typically used in currently published studies. The representation of patient trajectories used by M-BEHRT captures their sequential aspect, and opens new research avenues for understanding complex diseases and improving patient care.

17

Synthesising artificial patient-level data for Open Science - an evaluation of five methods

Allen, M.; Salmon, A.

2020-10-13 health informatics 10.1101/2020.10.09.20210138 medRxiv

Top 0.1%

9.8%

Show abstract

BackgroundOpen science is a movement seeking to make scientific research accessible to all, including publication of code and data. Publishing patient-level data may, however, compromise the confidentiality of that data if there is any significant risk that data may later be associated with individuals. Use of synthetic data offers the potential to be able to release data that may be used to evaluate methods or perform preliminary research without risk to patient confidentiality. MethodsWe have tested five synthetic data methods: O_LIA technique based on Principal Component Analysis (PCA) which samples data from distributions derived from the transformed data. C_LIO_LISynthetic Minority Oversampling Technique, SMOTE which is based on interpolation between near neighbours. C_LIO_LIGenerative Adversarial Network, GAN, an artificial neural network approach with competing networks - a discriminator network trained to distinguish between synthetic and real data., and a generator network trained to produce data that can fool the discriminator network. C_LIO_LICT-GAN, a refinement of GANs specifically for the production of structured tabular synthetic data. C_LIO_LIVariational Auto Encoders, VAE, a method of encoding data in a reduced number of dimensions, and sampling from distributions based on the encoded dimensions. C_LI Two data sets are used to evaluate the methods: O_LIThe Wisconsin Breast Cancer data set, a histology data set where all features are continuous variables. C_LIO_LIA stroke thrombolysis pathway data set, a data set describing characteristics for patients where a decision is made whether to treat with clot-busting medication. Features are mostly categorical, binary, or integers. C_LI Methods are evaluated in three ways: O_LIThe ability of synthetic data to train a logistic regression classification model. C_LIO_LIA comparison of means and standard deviations between original and synthetic data. C_LIO_LIA comparison of covariance between features in the original and synthetic data. C_LI ResultsUsing the Wisconsin Breast Cancer data set, the original data gave 98% accuracy in a logistic regression classification model. Synthetic data sets gave between 93% and 99% accuracy. Performance (best to worst) was SMOTE > PCA > GAN > CT-GAN = VAE. All methods produced a high accuracy in reproducing original data means and stabdard deviations (all R-square > 0.96 for all methods and data classes). CT-GAN and VAE suffered a significant loss of covariance between features in the synthetic data sets. Using the Stroke Pathway data set, the original data gave 82% accuracy in a logistic regression classification model. Synthetic data sets gave between 66% and 82% accuracy. Performance (best to worst) was SMOTE > PCA > CT-GAN > GAN > VAE. CT-GAN and VAE suffered loss of covariance between features in the synthetic data sets, though less pronounced than with the Wisconsin Breast Cancer data set. ConclusionsThe pilot work described here shows, as proof of concept, that synthetic data may be produced, which is of sufficient quality to publish with open methodology, to allow people to better understand and test methodology. The quality of the synthetic data also gives promise of data sets that may be used for screening of ideas, or for research project (perhaps especially in an education setting). More work is required to further refine and test methods across a broader range of patient-level data sets.

18

Replication of an open-access deep learning system for screening mammography: Reduced performance mitigated by retraining on local data

Condon, J. J. J.; Oakden-Rayner, L.; Hall, K. A.; Reintals, M.; Holmes, A.; Carneiro, G.; Palmer, L. J.

2021-06-01 radiology and imaging 10.1101/2021.05.28.21257892 medRxiv

Top 0.1%

8.7%

Show abstract

AimTo assess the generalisability of a deep learning (DL) system for screening mammography developed at New York University (NYU), USA (1, 2) in a South Australian (SA) dataset. Methods and MaterialsClients with pathology-proven lesions (n=3,160) and age-matched controls (n=3,240) were selected from women screened at BreastScreen SA from January 2010 to December 2016 (n clients=207,691) and split into training, validation and test subsets (70%, 15%, 15% respectively). The primary outcome was area under the curve (AUC), in the SA Test Set 1 (SATS1), differentiating invasive breast cancer or ductal carcinoma in situ (n=469) from age-matched controls (n=490) and benign lesions (n=44). The NYU system was tested statically, after training without transfer learning (TL), after retraining with TL and without (NYU1) and with (NYU2) heatmaps. ResultsThe static NYU1 model AUCs in the NYU test set (NYTS) and SATS1 were 83.0%(95%CI=82.4%-83.6%)(2) and 75.8%(95%CI=72.6%-78.8%), respectively. Static NYU2 AUCs in the NYTS and SATS1 were 88.6%(95%CI=88.3%-88.9%)(2) and 84.5%(95%CI=81.9%-86.8%), respectively. Training of NYU1 and NYU2 without TL achieved AUCs in the SATS1 of 65.8% (95%CI=62.2%-69.1%) and 85.9%(95%CI=83.5%-88.2%), respectively. Retraining of NYU1 and NYU2 with TL resulted in AUCs of 82.4%(95%CI=79.7-84.9%) and 86.3%(95%CI=84.0-88.5%) respectively. ConclusionWe did not fully reproduce the reported performance of NYU on a local dataset; local retraining with TL approximated this level of performance. Optimising models for local clinical environments may improve performance. The generalisation of DL systems to new environments may be challenging. Key ContributionsIn this study, the original performance of deep learning models for screening mammography was reduced in an independent clinical population. Deep learning (DL) systems for mammography require local testing and may benefit from local retraining. An openly available DL system approximates human performance in an independent dataset. There are multiple potential sources of reduced deep learning system performance when deployed to a new dataset and population.

19

Classification of Adolescent Drinking via Behavioral, Biological, and Environmental Features: A Machine Learning Approach with Bias Control

Liu, R.; Azzam, M.; Zabik, N.; Wan, S.; Blackford, J.; Wang, J.

2026-02-26 addiction medicine 10.64898/2026.02.24.26347002 medRxiv

Top 0.1%

8.7%

Show abstract

In 2024, approximately 30% of U.S. adolescents reported having consumed alcohol at least once in their lifetime, with about 25% of these individuals engaging in binge drinking. Adolescent alcohol use is associated with neurodevelopmental impairments, elevated risk of later alcohol use, and mental health disorders. These findings underscore the importance of identifying the variables driving adolescent alcohol use and leveraging them for early identification and targeted intervention. Previous studies have typically developed machine-learning classification models that use neuroimaging data in combination with limited clinical measurements. Neuroimaging data are expensive and difficult to obtain at scale, whereas clinical measures are more practical for large-scale screening due to their low cost and widespread accessibility. However, clinical-only approaches for alcohol drinking classification remain largely underexplored. Furthermore, prior studies have often focused on adults, limiting generalizability to the broader adolescent population. Additionally, confounding factors such as age and substance use, which are strongly correlated with alcohol consumption, have often been inadequately addressed, potentially inflating classification performance. Finally, class imbalance remains a persistent challenge, with prior attempts yielding only limited improvements. To address these limitations, we propose FocalTab, a framework that integrates TabPFN with focal loss for robust generalization and effective mitigation of class imbalance. The approach also incorporates an initial preprocessing step to remove confounding factors to account for age and substance-use. We compare FocalTab against state-of-the-art methods across different variable selections and dataset settings. FocalTab achieves the highest accuracy (84.3%) and specificity (80.0%) in the most stringent setting, in which both age and substance use variables were excluded, whereas competing models drop to near-chance specificity (12-24%). We further applied SHapley Additive exPlanations (SHAP) analysis to identify key clinical predictors of drinker classification, supporting enhanced screening and early intervention.

20

Baseline Acute Myeloid Leukemia Prognosis Models using Transcriptomic and Clinical Profiles by Studying the Impacts of Dimensionality Reductions and Gene Signatures on Cox-Proportional Hazard

Sauve, L.; Hebert, J.; Sauvageau, G.; Lemieux, S.

2022-12-10 bioinformatics 10.1101/2022.12.06.519415 medRxiv

Top 0.1%

8.5%

Show abstract

Gene marker extraction to evaluate risk in cancer can refine the diagnosis process and lead to adapted therapies and better survival. These survival analyses can be done through computer systems and Machine Learning (ML) algorithms such as the Cox-Proportional-Hazard model from gene expression (GE) RNA-Seq data. However, optimal tuning of CPH from genome-wide GE data is challenging and poorly assessed so far. In this work we propose to interrogate an Acute Myeloid Leukemia (AML) dataset (Leucegene) to derive key components of the CPH driving down its performance and discovering its sensitivity to various factors in hoping to ameliorate the system. In this study, we compare the projection and selection data reduction techniques, mainly the PCA and LSC17 gene signature in combination with the CPH in a ML framework. Results reveals that CPH performs better with a combination of clinical and genetic expression features. We determine that projections performs better than selections without clinical information. We ascertain that CPH is affected by overfitting and that this overfitting is linked to the number and the content of input covariables. We show that PCA links clinical features via ability to learn from the input data directly and generalizes better than LSC17 on Leucegene. We postulate that projection are preferred than selection on harder task such as assessing risk in the intermediate subset of Leucegene. We extrapolate that these findings apply in the more general context of risk detection via machine learning in cancer. We see that higher capacity models such as CPH-DNNs systems can be improved via survival-derived projections and combat overfitting through heavy regularization. Author summaryThis study aims to investigate the feasibility of using gene expression to evaluate risk in cancer, and to compare the projection and selection data reduction techniques. The study used the Leucegene dataset to compare the PCA method and a previously published 17 genes signature in combination with the Cox-Proportional-Hazard model in a machine learning framework. Results showed that CPH was affected by overfitting and that this overfitting was linked to the number and the content of input covariables. The study found that PCA links clinical features via ability to learn from the input data directly and generalizes better than LSC17 on Leucegene. The study concluded that projections are preferred than selection on harder task such as assessing risk in the intermediate subset of Leucegene and can be tuned to improve their performance. Data availability statementSource code for pipelines and algorithms, as well as gene expression matrices, are available here: https://github.com/lemieux-lab/dimensions_coxph. Access to the Leucegene cohorts survival times can be granted upon request and following ethical review.